196 research outputs found

    Representation Policy Iteration

    Full text link
    This paper addresses a fundamental issue central to approximation methods for solving large Markov decision processes (MDPs): how to automatically learn the underlying representation for value function approximation? A novel theoretically rigorous framework is proposed that automatically generates geometrically customized orthonormal sets of basis functions, which can be used with any approximate MDP solver like least squares policy iteration (LSPI). The key innovation is a coordinate-free representation of value functions, using the theory of smooth functions on a Riemannian manifold. Hodge theory yields a constructive method for generating basis functions for approximating value functions based on the eigenfunctions of the self-adjoint (Laplace-Beltrami) operator on manifolds. In effect, this approach performs a global Fourier analysis on the state space graph to approximate value functions, where the basis functions reflect the largescale topology of the underlying state space. A new class of algorithms called Representation Policy Iteration (RPI) are presented that automatically learn both basis functions and approximately optimal policies. Illustrative experiments compare the performance of RPI with that of LSPI using two handcoded basis functions (RBF and polynomial state encodings).Comment: Appears in Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence (UAI2005

    Manifold Alignment using Procrustes Analysis

    Get PDF
    In this paper we introduce a novel approach to manifold alignment, based on Procrustes analysis. Our approach di®ers from \semi- supervised alignment in that it results in a mapping that is de¯ned everywhere { when used with a suitable dimensionality reduction method { rather than just on the training data points. We describe and evaluate our approach both theoretically and experimen- tally, providing results showing useful knowl- edge transfer from one domain to another. Novel applications of our method including cross-lingual information retrieval and trans- fer learning in Markov decision processes are presented

    Randomized and Deterministic Attention Sparsification Algorithms for Over-parameterized Feature Dimension

    Full text link
    Large language models (LLMs) have shown their power in different areas. Attention computation, as an important subroutine of LLMs, has also attracted interests in theory. Recently the static computation and dynamic maintenance of attention matrix has been studied by [Alman and Song 2023] and [Brand, Song and Zhou 2023] from both algorithmic perspective and hardness perspective. In this work, we consider the sparsification of the attention problem. We make one simplification which is the logit matrix is symmetric. Let nn denote the length of sentence, let dd denote the embedding dimension. Given a matrix XRn×dX \in \mathbb{R}^{n \times d}, suppose dnd \gg n and XX<r\| X X^\top \|_{\infty} < r with r(0,0.1)r \in (0,0.1), then we aim for finding YRn×mY \in \mathbb{R}^{n \times m} (where mdm\ll d) such that \begin{align*} \| D(Y)^{-1} \exp( Y Y^\top ) - D(X)^{-1} \exp( X X^\top) \|_{\infty} \leq O(r) \end{align*} We provide two results for this problem. \bullet Our first result is a randomized algorithm. It runs in O~(nnz(X)+nω)\widetilde{O}(\mathrm{nnz}(X) + n^{\omega} ) time, has 1δ1-\delta succeed probability, and chooses m=O(nlog(n/δ))m = O(n \log(n/\delta)). Here nnz(X)\mathrm{nnz}(X) denotes the number of non-zero entries in XX. We use ω\omega to denote the exponent of matrix multiplication. Currently ω2.373\omega \approx 2.373. \bullet Our second result is a deterministic algorithm. It runs in O~(min{i[d]nnz(Xi)2,dnω1}+nω+1)\widetilde{O}(\min\{\sum_{i\in[d]}\mathrm{nnz}(X_i)^2, dn^{\omega-1}\} + n^{\omega+1}) time and chooses m=O(n)m = O(n). Here XiX_i denote the ii-th column of matrix XX. Our main findings have the following implication for applied LLMs task: for any super large feature dimension, we can reduce it down to the size nearly linear in length of sentence

    An Over-parameterized Exponential Regression

    Full text link
    Over the past few years, there has been a significant amount of research focused on studying the ReLU activation function, with the aim of achieving neural network convergence through over-parametrization. However, recent developments in the field of Large Language Models (LLMs) have sparked interest in the use of exponential activation functions, specifically in the attention mechanism. Mathematically, we define the neural function F:Rd×m×RdRF: \mathbb{R}^{d \times m} \times \mathbb{R}^d \rightarrow \mathbb{R} using an exponential activation function. Given a set of data points with labels {(x1,y1),(x2,y2),,(xn,yn)}Rd×R\{(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\} \subset \mathbb{R}^d \times \mathbb{R} where nn denotes the number of the data. Here F(W(t),x)F(W(t),x) can be expressed as F(W(t),x):=r=1marexp(wr,x)F(W(t),x) := \sum_{r=1}^m a_r \exp(\langle w_r, x \rangle), where mm represents the number of neurons, and wr(t)w_r(t) are weights at time tt. It's standard in literature that ara_r are the fixed weights and it's never changed during the training. We initialize the weights W(0)Rd×mW(0) \in \mathbb{R}^{d \times m} with random Gaussian distributions, such that wr(0)N(0,Id)w_r(0) \sim \mathcal{N}(0, I_d) and initialize ara_r from random sign distribution for each r[m]r \in [m]. Using the gradient descent algorithm, we can find a weight W(T)W(T) such that F(W(T),X)y2ϵ\| F(W(T), X) - y \|_2 \leq \epsilon holds with probability 1δ1-\delta, where ϵ(0,0.1)\epsilon \in (0,0.1) and m=Ω(n2+o(1)log(n/δ))m = \Omega(n^{2+o(1)}\log(n/\delta)). To optimize the over-parameterization bound mm, we employ several tight analysis techniques from previous studies [Song and Yang arXiv 2019, Munteanu, Omlor, Song and Woodruff ICML 2022]
    corecore